Skip to content

Conversation

@srwilson
Copy link

@srwilson srwilson commented Jan 8, 2013

If supervisor crashes, its child processes left running and inherited by supervisor's parent process. I've added a way to send these processes a stop signal if the parent goes away using prctl(PR_SET_PDEATHSIG, SIGTERM). It only works on Linux. On other platforms exceptions are caught and ignored.

I've not run it on anything besides Linux yet.

To test it, start supervisor then send it a kill -9. The child process receives the SIGTERM as expected.

Use prctl(PR_SET_PDEATHSIG, SIGTERM) to send signal when
parent dies
@btubbs
Copy link
Contributor

btubbs commented Jan 11, 2013

+1!

@swistakm
Copy link

This fix would be very useful.

@dexterbt1
Copy link
Contributor

Seen this happen once in our staging setup. This is worth the fix.

@lukeweber
Copy link
Contributor

+1 Thought maybe only running the code when you're on linux might be better than just catching the exception and also added a constant. Happy if you want to merge the change into this PR.

https://github.com/lukeweber/supervisor/tree/zombie-process-fix

@mnaberez
Copy link
Member

#649 is an alternative version of this pull request that adds a check.

@lukeweber
Copy link
Contributor

#649 is an alternative version of this pull request that adds a check.

And I added a constant PR_SET_PDEATHSIG with a comment about where it came from in case anyone's unsure about what 1 does. Also no conflicts to merge. This pull has been open for a long time.

@mnaberez
Copy link
Member

Thanks for the pull request @srwilson and @lukeweber.

First, if you guys have seen supervisord crash, do you know how it crashed and has that been fixed? Certainly it has more bugs and will crash, but let's try to fix whatever condition you saw that prompted this.

About PR_SET_PDEATHSIG:

If supervisord crashes, its subprocesses are reparented, and they should keep running. Wouldn't that actually be desirable in some cases? For example, if you are running a database or cache under it, that may prevent data loss. If supervisord crashes, and PR_SET_PDEATHSIG causes those processes to just get SIGKILLed, they're gone and have no opportunity for clean shutdown. If supervisord crashes today, it probably means a sysadmin has to intervene anyway, and at least now that person can make the call how to deal with it. Does this sound reasonable?

There's not another case in supervisord where we have to dig around with ctypes or has an OS-specific (Linux) feature. I'm not too thrilled about doing that, but if you guys think it will really help your deployment, I'm open to it. How about a setting under a [program:x] section where a PR_SET_PDEATHSIG signal can be specified (default None)? That would preserve the existing conservative behavior, but make this available.

I do agree with a change in @lukeweber's PR, if we know this is a Linux-only thing, we should only attempt it on Linux. Also, most config settings will raise ValueError which halts startup if an invalid option is specified. If we added this option, but the OS didn't support it or the call failed, I think it should raise to abort supervisord startup, forcing the user to notice and correct the config file.

@btubbs
Copy link
Contributor

btubbs commented Aug 19, 2015

I run supervisor as a component in a PaaS (https://bitbucket.org/yougov/velociraptor/) where application instances are typically redundant and load balanced anyway. We can easily tolerate Supervisor dying and taking all its processes with it, but if it dies and leaves zombies behind, that's quite a mess to clean up.

+1 to the PR.

A Linux-specific config would be fine.

@lukeweber
Copy link
Contributor

I don't think I had a specific crash but was playing with things when I was figuring out my code deployment.

Was using unicornherder in supervisor 3.0 and this is what I ran into.

  1. Kill supervisord. unicornherder and gunicorn workers are running.
  2. Start supervisord. Now I notice that I have two unicornherder processes and same worker children(4 total with same pids) as before.
  3. Supervisorctl reports the new pid for the unicornherder, but it seems to communicate with the old works and so does the orphaned unicorn herder pid.

So I've verified besides extra unicornherder processes this isn't really a problem I think because either process does communicate with the workers, but initially I was concerned that I might be sending my signal to the wrong pid because I rely on this to update code(supervisorctl -c $SUPERVISOR_CONF pid $APP_NAME | xargs kill -s HUP) and if it wasn't talking to the right workers it would silently fail.

@mnaberez - I'm not an expert on the process logic, but could you confirm: Even though things are reparented, they aren't managed as the old running process and a new process is launched when supervisord starts?

I think it's a question of whether you want the possibilities of zombies or multiple processes running, which is sometimes even worse than nothing, but agree it's probably best as a judgement call leaving the default as is. I'm thinking of things that might not be safe to run two of here or might be strictly undesirable.

@mnaberez
Copy link
Member

@mnaberez - I'm not an expert on the process logic, but could you confirm: Even though things are reparented, they aren't managed as the old running process and a new process is launched when supervisord starts?

If supervisord crashes because of a bug, or if supervisord is kill -9ed, then the supervisord process is gone but the subprocesses it spawned are still running. They no longer have supervisord as their parent so they are reparented (adopted by the init system). A new instance of supervisord doesn't have any knowledge or control over those processes. A supervisord instance only knows about the children it spawns itself.

Not quite related to this ticket, however: an orphaned process situation like you described can happen even if supervisord doesn't crash. Some programs like gunicorn spawn their own worker processes. supervisord doesn't know about those processes. supervisord only knows about the "master" gunicorn process that it started, not any processes that gunicorn spawned. In this case, you should be sure that you set killasgroup=true in the [program:gunicorn] config. If gunicorn doesn't shut down quickly enough (as defined by stopwaitsecs=) then supervisord will send SIGKILL to it. If gunicorn is killed with SIGKILL, it won't be able to shut down its workers, so they may stay around. Using killasgroup=true tells supervisord to signal the process group, so the gunicorn master and its workers all receive the kill signal.

I think it's a question of whether you want the possibilities of zombies or multiple processes running, which is sometimes even worse than nothing, but agree it's probably best as a judgement call leaving the default as is. I'm thinking of things that might not be safe to run two of here or might be strictly undesirable.

Yeah, if we add support for PR_SET_PDEATHSIG, I'm thinking it's probably safer to let the user decide in the config file if it should be used and how. I have reservations about this PR just setting it to SIGKILL all the time for the reasons I mentioned above.

@lukeweber
Copy link
Contributor

I've updated my pull request to be a config option that can only be set on linux and I added tests for the config.

lukeweber added a commit to lukeweber/supervisor that referenced this pull request Sep 30, 2015
* Original pull inspired by: Supervisor#199
* If the supervisor process dies, that the child process will receive this signal set in prsetpdeathsig.
lukeweber added a commit to lukeweber/supervisor that referenced this pull request Sep 30, 2015
lukeweber added a commit to lukeweber/supervisor that referenced this pull request Sep 30, 2015
@mnaberez
Copy link
Member

mnaberez commented Nov 1, 2015

#649 is an updated version of this PR that includes changes from the discussion above, and still includes the two commits from this one. I am going to close this PR in favor of the new one. I've added a note there so we don't forget to credit @srwilson (the author of this patch) in the changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants